Pipelining

Pipelining

Majority of the time, we are writing data to registers.

To execute just one instruction, we have to do:

  1. Fetch: Bring the instruction into the CPU
  2. Decode: Find out what is this instruction
  3. Execute: Perform the instruction
  4. Writeback: Write data to register/

Suppose each step takes 1 nanosecond to execute

F1D1E1WB1
1ns1ns1ns1ns

Now, suppose we want to ad a second instruction

F1D1E1WB1F2D2E2WB2
1ns1ns1ns1ns1ns1ns1ns1ns

Pipelining is about looking at when the machine is at D1 and seeing that the hardware that was used for F1 is now free and needlessly idling, so instead we can do this:

F1D1E1WB1
F2D2E2WB2
1ns1ns1ns1ns1ns

This is the execution of on instruction:

FDEWB

Modular v.s. Whole System: You can swap parts out in a modular system, but not in a whole system.

A real pipeline actually has overhead.

Here, we use buffers to control the flow of input.

Doodle

Alternative: Async

Doodle

Speed

To calculate speed, we need to calculate:

  1. Delay without pipelining.
  2. Delay with pipelining.
Doodle
# of InstructionsDelay without PipeliningDelay with Pipelining
1k\tauk\tau
22k\tauk + 1\tau
33k\tauk + 2\tau
NNk\tauk + (N-1)\tau

\text{Speedup} = \frac{ \text{Delay without piplining} }{ \text{Delay with piplining} } = \frac{ Nk\tau }{ k+(N-1)\tau }

Speedup as n \to \inf:

\lim_{N \to \inf} \frac{ Nk }{ k+(N-1) } = k

On Architectures

The Problem

F1D1E1WB1
F2D2E2WB2
F3D3E3WB3

Imagine the first instruction is load R1, addr1

Solutions

Bad Solution I: Introduce a delay to all instructions.

Bad Solution II: Introduce a delay to just F3 (NOP).

Better Solution: Harvard Architecture: Separate instruction and data memory into separate lines.

harvard

Better Solution: Von Neuman Architecture: Instruction and data memory are the same.

von

Best Solution: BMW Architecture: Instruction and data memory are the same, but separate lines are provided for each.

bmw

On ALU

ALU
F1D1E1WB1
F2D2E2WB2

Imagine the instructions are:

  1. ADD R1, R2, R3
  2. SUB R4, R1, R5

At E2, we see the instruction wants to use R1, but the R1 that was just modified hasn’t been written to the register on WB1 yet.

Bad Solutions:

F1D1E1WB1
F2D2E2WB2
F1D1E1WB1
NOP
NOP
F2D2E2WB2

Better Solutions: Instead of dealing with writing R1 just to read R1, you can short circuit R1 in the ALU to make the result immediately available before WB1.

HOTWIRE

More on Limits

So far we’ve been dealing with stable \tau

But, what if a step takes 2\tau?

FDEWB
\tau\tau2\tau\tau

Why not just increase k (pipelining)?

Assorted Questions

Q: What is the physical limit of pipelining? A: If the stage delay is more than 10 times of the delay of the latch.

Q: How to we fix the following:

So far we’ve been dealing with stable \tau

But, what if a step takes 2\tau?

FDEWB
\tau\tau2\tau\tau

A: Divide the execution part by two.

FDE_1E_2WB
\tau\tau\tau\tau

Q: Okay, so what if in some cases, we don’t need E_2?

A: “You are using a perfect pipeling, and thus everything will be the same”

Q: Why can we never have the ideal speedup?

A: Because of instruction dependency. Recall these two conflicts:

  1. The load example uses the bus while F3 wants to run.
  2. Doing R1 <- R2 + R3 followed by R4 <- R1 + R5 and having to account for R1 not being available.